There are 28 examples not solved by any model.
Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
CRUXEval-output/112, CRUXEval-output/113, CRUXEval-output/129, CRUXEval-output/149, CRUXEval-output/163, CRUXEval-output/175, CRUXEval-output/177, CRUXEval-output/218, CRUXEval-output/229, CRUXEval-output/245, CRUXEval-output/250, CRUXEval-output/254, CRUXEval-output/259, CRUXEval-output/272, CRUXEval-output/280, CRUXEval-output/301, CRUXEval-output/307, CRUXEval-output/33, CRUXEval-output/340, CRUXEval-output/375, CRUXEval-output/445, CRUXEval-output/44, CRUXEval-output/469, CRUXEval-output/488, CRUXEval-output/581, CRUXEval-output/622, CRUXEval-output/640, CRUXEval-output/671
| example_link | model | min_elo |
|---|---|---|
| CRUXEval-output/444 | gpt-4-turbo-2024-04-09+cot | 1508.026 |
| CRUXEval-output/599 | gpt-4-turbo-2024-04-09+cot | 1508.026 |
| CRUXEval-output/169 | gpt-4-turbo-2024-04-09+cot | 1508.026 |
| CRUXEval-output/484 | gpt-4-turbo-2024-04-09+cot | 1508.026 |
| CRUXEval-output/698 | gpt-4-turbo-2024-04-09+cot | 1508.026 |
| CRUXEval-output/125 | gpt-4-turbo-2024-04-09+cot | 1508.026 |
| CRUXEval-output/591 | gpt-4-turbo-2024-04-09+cot | 1508.026 |
| CRUXEval-output/126 | gpt-4-turbo-2024-04-09+cot | 1508.026 |
| CRUXEval-output/458 | gpt-4-turbo-2024-04-09+cot | 1508.026 |
| CRUXEval-output/35 | gpt-4-turbo-2024-04-09+cot | 1508.026 |
| CRUXEval-output/220 | gpt-4-turbo-2024-04-09+cot | 1508.026 |
| CRUXEval-output/501 | claude-3-opus-20240229+cot | 1489.546 |
| CRUXEval-output/317 | claude-3-opus-20240229+cot | 1489.546 |
| CRUXEval-output/726 | claude-3-opus-20240229+cot | 1489.546 |
| CRUXEval-output/391 | claude-3-opus-20240229+cot | 1489.546 |
| CRUXEval-output/158 | claude-3-opus-20240229+cot | 1489.546 |
| CRUXEval-output/631 | gpt-4-0613+cot | 1392.187 |
| CRUXEval-output/5 | gpt-4-0613+cot | 1392.187 |
| CRUXEval-output/310 | gpt-4-0613+cot | 1392.187 |
| CRUXEval-output/799 | gpt-4-0613+cot | 1392.187 |
| CRUXEval-output/438 | gpt-4-0613 | 1283.246 |
| CRUXEval-output/556 | gpt-4-turbo-2024-04-09 | 1267.174 |
| CRUXEval-output/211 | gpt-4-turbo-2024-04-09 | 1267.174 |
| CRUXEval-output/550 | gpt-3.5-turbo-0613+cot | 1116.281 |
| CRUXEval-output/568 | gpt-3.5-turbo-0613+cot | 1116.281 |
| CRUXEval-output/749 | codellama-34b+cot | 884.835 |
| CRUXEval-output/499 | mixtral-8x7b | 855.530 |
| CRUXEval-output/347 | codellama-7b+cot | 644.964 |
| CRUXEval-output/571 | codellama-7b+cot | 644.964 |
| CRUXEval-output/209 | phi-1 | 610.177 |
These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )
| example_link | acc | tau |
|---|---|---|
| CRUXEval-output/329 | 0.686 | -0.358 |
| CRUXEval-output/563 | 0.800 | -0.322 |
| CRUXEval-output/333 | 0.400 | -0.301 |
| CRUXEval-output/297 | 0.371 | -0.286 |
| CRUXEval-output/691 | 0.314 | -0.262 |
| CRUXEval-output/118 | 0.514 | -0.258 |
| CRUXEval-output/132 | 0.457 | -0.245 |
| CRUXEval-output/209 | 0.029 | -0.239 |
| CRUXEval-output/57 | 0.629 | -0.238 |
| CRUXEval-output/638 | 0.114 | -0.236 |
Histogram of problems by the accuracy on each problem.
Histogram of problems by the minimum Elo to solve each problem.